scene dynamic
Reviews: Generating Videos with Scene Dynamics
Overall this paper is very clearly laid out, and it is very easy to follow. Given that the authors are basing much of their method on existing methods for image generation, the novelty of the method lies in the way they adapted such methods to generate video. It is important to emphasize that I am not familiar with any other papers that attempt to do this (and the authors also didn't seem to be able to find other such papers). The problem with video, unlike images is that low frequencies are not only spanning space, but also time. Therefore, when generating video, typical methods will attempt to generate the temporal low frequencies first, resulting in very jarring outputs.
ManiGaussian: Dynamic Gaussian Splatting for Multi-task Robotic Manipulation
Lu, Guanxing, Zhang, Shiyi, Wang, Ziwei, Liu, Changliu, Lu, Jiwen, Tang, Yansong
Performing language-conditioned robotic manipulation tasks in unstructured environments is highly demanded for general intelligent robots. Conventional robotic manipulation methods usually learn semantic representation of the observation for action prediction, which ignores the scene-level spatiotemporal dynamics for human goal completion. In this paper, we propose a dynamic Gaussian Splatting method named ManiGaussian for multi-task robotic manipulation, which mines scene dynamics via future scene reconstruction. Specifically, we first formulate the dynamic Gaussian Splatting framework that infers the semantics propagation in the Gaussian embedding space, where the semantic representation is leveraged to predict the optimal robot action. Then, we build a Gaussian world model to parameterize the distribution in our dynamic Gaussian Splatting framework, which provides informative supervision in the interactive environment via future scene reconstruction. We evaluate our ManiGaussian on 10 RLBench tasks with 166 variations, and the results demonstrate our framework can outperform the state-of-the-art methods by 13.1\% in average success rate.
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
- Asia > Japan > Honshū > Chūbu > Ishikawa Prefecture > Kanazawa (0.04)
- Asia > China > Guangdong Province > Shenzhen (0.04)
- Research Report > Promising Solution (0.48)
- Research Report > New Finding (0.48)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Robots (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.54)
- Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (0.39)
Deep Learning Method for Cell-Wise Object Tracking, Velocity Estimation and Projection of Sensor Data over Time
Braun, Marco, Luszek, Moritz, Meuter, Mirko, Spata, Dominic, Kollek, Kevin, Kummert, Anton
Current Deep Learning methods for environment segmentation and velocity estimation rely on Convolutional Recurrent Neural Networks to exploit spatio-temporal relationships within obtained sensor data. These approaches derive scene dynamics implicitly by correlating novel input and memorized data utilizing ConvNets. We show how ConvNets suffer from architectural restrictions for this task. Based on these findings, we then provide solutions to various issues on exploiting spatio-temporal correlations in a sequence of sensor recordings by presenting a novel Recurrent Neural Network unit utilizing Transformer mechanisms. Within this unit, object encodings are tracked across consecutive frames by correlating key-query pairs derived from sensor inputs and memory states, respectively. We then use resulting tracking patterns to obtain scene dynamics and regress velocities. In a last step, the memory state of the Recurrent Neural Network is projected based on extracted velocity estimates to resolve aforementioned spatio-temporal misalignment.
Unifying (Machine) Vision via Counterfactual World Modeling
Bear, Daniel M., Feigelis, Kevin, Chen, Honglin, Lee, Wanhee, Venkatesh, Rahul, Kotar, Klemen, Durango, Alex, Yamins, Daniel L. K.
Leading approaches in machine vision employ different architectures for different tasks, trained on costly task-specific labeled datasets. This complexity has held back progress in areas, such as robotics, where robust task-general perception remains a bottleneck. In contrast, "foundation models" of natural language have shown how large pre-trained neural networks can provide zero-shot solutions to a broad spectrum of apparently distinct tasks. Here we introduce Counterfactual World Modeling (CWM), a framework for constructing a visual foundation model: a unified, unsupervised network that can be prompted to perform a wide variety of visual computations. CWM has two key components, which resolve the core issues that have hindered application of the foundation model concept to vision. The first is structured masking, a generalization of masked prediction methods that encourages a prediction model to capture the low-dimensional structure in visual data. The model thereby factors the key physical components of a scene and exposes an interface to them via small sets of visual tokens. This in turn enables CWM's second main idea -- counterfactual prompting -- the observation that many apparently distinct visual representations can be computed, in a zero-shot manner, by comparing the prediction model's output on real inputs versus slightly modified ("counterfactual") inputs. We show that CWM generates high-quality readouts on real-world images and videos for a diversity of tasks, including estimation of keypoints, optical flow, occlusions, object segments, and relative depth. Taken together, our results show that CWM is a promising path to unifying the manifold strands of machine vision in a conceptually simple foundation.
- North America > United States > California > Santa Clara County > Stanford (0.04)
- Asia > Middle East > Israel > Tel Aviv District > Tel Aviv (0.04)
- Asia > Japan > Honshū > Chūbu > Ishikawa Prefecture > Kanazawa (0.04)
Generating Videos with Scene Dynamics
Vondrick, Carl, Pirsiavash, Hamed, Torralba, Antonio
We capitalize on large amounts of unlabeled video in order to learn a model of scene dynamics for both video recognition tasks (e.g. We propose a generative adversarial network for video with a spatio-temporal convolutional architecture that untangles the scene's foreground from the background. Experiments suggest this model can generate tiny videos up to a second at full frame rate better than simple baselines, and we show its utility at predicting plausible futures of static images. Moreover, experiments and visualizations show the model internally learns useful features for recognizing actions with minimal supervision, suggesting scene dynamics are a promising signal for representation learning. We believe generative video models can impact many applications in video understanding and simulation.
Generating Videos with Scene Dynamics - MIT
Learning models that generate videos may also be a promising way to learn representations. For example, we can train generators on a large repository of unlabeled videos, then fine-tune the discriminator on a small labeled dataset in order to recognize some actions with minimal supervision. We can also visualize what emerges in the representation for predicting the future. While not all units are semantic, we found there are a few hidden units that fire on objects which are sources of motions, such as people or train tracks. Since generating the future requires understanding moving objects, the network may learn to recognize these objects internally, even though it is not supervised to do so.